Install required packages
First, we install the required packages for the analysis: tidyr, ggplot2, GGally, gridExtra (for multiple plots)
install.packages("tidyr", repos = "http://cloud.r-project.org")
install.packages("ggplot2", repos = "http://cloud.r-project.org")
install.packages("GGally", repos = "http://cloud.r-project.org")
install.packages("gridExtra", repos = "http://cloud.r-project.org")
and make the packages available
library(tidyr); library(dplyr); library(ggplot2); library(gridExtra)
Read data from data folder:
alc <- read.table("C:/Users/palme/Documents/IODS-project/data/alc.csv", sep=",", header=TRUE)
Glimpse at the alc data
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
Dataset information:
This dataset contains measures of student achievement of two Portuguese schools. The data include student grades, demographic, social and school related features as additional variables). The dataset provides average performance in two distinct subjects: mathematics and Portuguese language.
The dataset consists of 35 variables and 382 observations
Variables:
- school - student’s school
- sex - student’s sex
- age - student’s age
- address - urban or rural
- famsize - family size
- Pstatus - parent’s cohabitation status
- Medu - mother’s education
- Fedu - father’s education
- Mjob - mother’s job
- Fjob - father’s job
- reason - reason to choose this school
- guardian - student’s guardian
- traveltime - home to school travel time
- studytime - weekly study time
- failures - number of past class failures
- schoolsup - extra educational support
- famsup - family educational support
- paid - extra paid classes
- activities - extra-curricular activities
- nursery - attended nursery school
- higher - wants to take higher education
- internet - Internet access at home
- romantic - with a romantic relationship
- famrel - quality of family relationships
- freetime - free time after school
- goout - going out with friends
- Dalc - workday alcohol consumption 28 Walc - weekend alcohol consumption use
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - health status
- absences - number of absences
- G1 - first period average grade (math and portuegese)
- G2 - second period average grade (math and portuegese)
- G3 - final period average grade (math and portuegese)
- alc_use’ - average of ‘Dalc’ and ‘Walc’
- ‘high_use’ -is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
Analysis of the causes of high alcohol consumption
First we analyze the data, by drawing a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Main hypothesis
There is quite a lot of variation between respondents’ free time and study time, so we use these variables along with age and sex as explanatory variables for alcohol consumption. The main hypothesis is that respondents with a lot of free time also have time to consume alcohol. Moreover, respondents that report many hours of study choose not to consume alcohol, which might decrease their study ability to study. Morever, age might be positively correlated with alcohol consumption, as older respondents drink more than their younger counterparts. Finally, there might be gender differences with respect to alcohol consumption.
We study the relationship between high_use and studytime, age, freetime, sex First, we produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean(studytime), mean(age), mean(freetime))
## # A tibble: 4 x 6
## # Groups: sex [2]
## sex high_use count `mean(studytime)` `mean(age)` `mean(freetime)`
## <fct> <lgl> <int> <dbl> <dbl> <dbl>
## 1 F FALSE 156 2.34 16.6 2.93
## 2 F TRUE 42 2 16.5 3.36
## 3 M FALSE 112 1.88 16.3 3.39
## 4 M TRUE 72 1.64 17.0 3.5
Graphical illustration of data:
g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) #plot high_use and age, separate by sex
g1 + geom_boxplot() + ggtitle("Age by alcohol consumption and sex") # define the plot as a boxplot and draw it

g2 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex)) # plot high_use and freetime by sex
g2 + geom_boxplot() + ggtitle("Study time by alcohol consumption and sex") # define the plot as a boxplot and draw it

g3 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) #plot of high_use and freetime
g3 + geom_boxplot() + ggtitle("Free time by alcohol consumption and sex") # define the plot as a boxplot and draw it

Main findings:
- The mean age of males that report high use of alcohol is higher than those males who report low level drinking.
- The mean age of females that report high use of alcohol is lower than those consume less alcohol, although the mean age for both groups is almost the same.
- Males and females who report high use of alcohol on average have more free time than those who consume less alcohol.
- Males and females who report high use of alchol on average study less than those who consume less alcohol.
- Higher share of men report high alcohol use than women. (0.64 vs 0.27)
Logistic regression
Run logistic regression with high_use as the dependent variable and studytime, age, freetime, sex as explanatory variables
m <- glm(high_use ~ age + studytime + freetime + sex, data = alc, family = "binomial")
The summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ age + studytime + freetime + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6526 -0.8509 -0.6602 1.1301 2.2095
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.8768 1.7790 -2.741 0.00612 **
## age 0.2290 0.1011 2.265 0.02349 *
## studytime -0.4697 0.1602 -2.931 0.00337 **
## freetime 0.2503 0.1229 2.037 0.04166 *
## sexM 0.5854 0.2482 2.359 0.01833 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 431.41 on 377 degrees of freedom
## AIC: 441.41
##
## Number of Fisher Scoring iterations: 4
All coefficients are statistically significant at the 5% level, suggesting that the chosen variables explain alcohol consumption.
Compute the Odds Ratios (OR)
OR <- coef(m) %>% exp
and confidence intervals
CI <- confint(m) %>% exp
The Odds Ratios and confidence intervals are provided in the table below:
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.007621647 0.0002191106 0.2381440
## age 1.257384089 1.0338141044 1.5381420
## studytime 0.625158887 0.4522877516 0.8490039
## freetime 1.284393173 1.0119391550 1.6399589
## sexM 1.795640894 1.1056484552 2.9303141
The odds ratios can be interpreted as the change in the odds of high alcohol use given a one unit increase in the explanatory variable. For example, the regression coefficient for age is 0.2290334. This indicates that a one unit increase in age increases the odds of being high alcohol use by exp(0.0.2290334)=1.3) times.
To summarize, the amount of free time, age, and being male increase the odds of high alcohol consumption, whereas the amount of study time decreases the odds of high alcohol use, as hypothesized above.
Exploring the predictive power of the model
First make the prediction of the prediction
probabilities <- predict(m, type = "response") # predict() the probability of high_use
alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use
and check to see the predictions of the last 10 observations
select(alc, studytime, freetime, age, sex, high_use, probability, prediction) %>% tail(10) # see the last ten original classes, predicted probabilities, and class predictions
## studytime freetime age sex high_use probability prediction
## 373 1 3 19 M FALSE 0.5845170 TRUE
## 374 1 4 18 M TRUE 0.5896690 TRUE
## 375 3 3 18 F FALSE 0.1958322 FALSE
## 376 1 4 18 F FALSE 0.4445380 FALSE
## 377 3 4 19 F FALSE 0.2822698 FALSE
## 378 2 3 18 F FALSE 0.2803350 FALSE
## 379 2 2 18 F FALSE 0.2327073 FALSE
## 380 2 1 18 F FALSE 0.1910235 FALSE
## 381 1 4 17 M TRUE 0.5333414 TRUE
## 382 1 4 18 M TRUE 0.5896690 TRUE
Predictions vs actual values
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 255 13
## TRUE 95 19
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction)) # plot 'high_use' versus 'probability' in 'alc'
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66753927 0.03403141 0.70157068
## TRUE 0.24869110 0.04973822 0.29842932
## Sum 0.91623037 0.08376963 1.00000000
#define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
There is a high number of false negatives, i.e., that the model predicts low alcohol use, wheras the number of false positives is quite low. However, this suggests that the predictive ability of the model is quite low. This could be adjusted by lowering the acceptance probability of the model.
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2827225
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
On average the model makes wrong predictions 0.2827225 percent of the time, which is better than the average number of wrong predictions given random guess:
# compute the average number of wrong predictions in the (training) data using random predictions
loss_func(class = alc$high_use, prob = runif(length(alc$high_use))>0.5)
## [1] 0.4842932
As the number of predictions goes to infinity, the average share of wrong predictions should be 0.5.